The dataset (fictional) comprises scores from three tests of students at a public school, as well as personal and socio-economic factors that may have interaction effects. The purpose of collecting this data is to identify potential relationships between personal and socio-economic factors and test scores. By understanding which factors have the most significant impact on test scores, educators, parents, and students can focus their attention on variables that could potentially help students earn higher scores. This information can be used to inform educational policies and practices, as well as to develop targeted interventions to support students who may be struggling academically. Ultimately, the goal of collecting this data is to improve educational outcomes and promote academic success for all students.
Student education and testing scores are critical to their future success. Education provides the foundation upon which students will build their future, and improving the efficiency of learning and testing experiences can have a positive impact on many students' futures. By analyzing the dataset, we can identify factors that influence student performance and develop strategies to improve academic outcomes.
Our metric of success in the data mining process can be assessed with whether or not we are able to uncover patterns and/or significant relationships that help us to better understand student performance factors. This could include identifying significant predictors of test scores, understanding the impact of certain attributes on student scores, etc. By evaluating the results of our analyses in light of the original research questions, we can determine whether we have gained useful insights from our data mining process which can inform educational policies and practices and promote academic success for all students.
To measure the effectiveness of a good prediction algorithm, we need to evaluate its accuracy in predicting test scores and its interpretability in identifying the factors that influence those scores. Accuracy can be measured using standard metrics such as mean squared error or root mean squared error, which compare the predicted test scores to the actual scores. However, accuracy alone is not sufficient to evaluate the effectiveness of a prediction algorithm. We also need to consider the interpretability of the results, which requires us to examine the relationship between the input variables and the predicted test scores.
Our team will be performing exploratory data analysis with an emphasis on visualization and data processing using the dataset found on Kaggle. The EDA will consist of statistical summaries of attributes, visualizing attributes, and providing interpretations and conclusions based on the visualization and data processing.
# Needed Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import warnings
warnings.simplefilter('ignore', DeprecationWarning)
%matplotlib inline
import seaborn as sns
from tabulate import tabulate
# remove "future" warnings
warnings.simplefilter(action="ignore", category= FutureWarning)
df = pd.read_csv("Expanded_data_with_more_features.csv")
# select all columns except the first column (index)
df = df[df.columns[1:]]
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 30641 entries, 0 to 30640 Data columns (total 14 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Gender 30641 non-null object 1 EthnicGroup 28801 non-null object 2 ParentEduc 28796 non-null object 3 LunchType 30641 non-null object 4 TestPrep 28811 non-null object 5 ParentMaritalStatus 29451 non-null object 6 PracticeSport 30010 non-null object 7 IsFirstChild 29737 non-null object 8 NrSiblings 29069 non-null float64 9 TransportMeans 27507 non-null object 10 WklyStudyHours 29686 non-null object 11 MathScore 30641 non-null int64 12 ReadingScore 30641 non-null int64 13 WritingScore 30641 non-null int64 dtypes: float64(1), int64(3), object(10) memory usage: 3.3+ MB
| Feature | Description | Data Type | Value Type (Scale) |
|---|---|---|---|
| Gender | Gender of the Student | Object (Categorical) | Male/Female |
| EthnicGroup | Ethnicity of the Student | Object (Categorical) | Group A to E |
| ParentEduc | Parent(s) Educational Background | Object (Categorical) | Some High School to Master's Degree |
| LunchType | School Lunch Type | Object (Categorical) | Standard or Free/Reduced |
| TestPrep | Test Preparation Course Completed | Object (Categorical) | Completed or None |
| ParentMaritalStatus | Parent(s) Martial Status | Object (Categorical) | Married/Single/Windowed/Divorced |
| PracticeSport | How Often Student Practice Sport(s) | Object (Categorical) | Never/Sometimes/Regularly |
| IsFirstChild | First Child of the Family | Object (Categorical) | Yes/No |
| NrSiblings | # of Siblings | Int64(Continuous) | 0 - 7 |
| TransportMeans | Transportation | Object (Categorical) | School Bus or Private |
| WklyStudyHours | # of Hours Study Weekly | Object (Categorical) | Less than 5 hours; Between 5 and 10 hours; More than 10 hours |
| MathScore | Math Test Score | Int64(Continuous) | 0 - 100 |
| ReadingScore | Reading Test Score | Int64(Continuous) | 0 - 100 |
| WritingScore | Writing Test Score | Int64(Continuous) | 0 - 100 |
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 30641 entries, 0 to 30640 Data columns (total 14 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Gender 30641 non-null object 1 EthnicGroup 28801 non-null object 2 ParentEduc 28796 non-null object 3 LunchType 30641 non-null object 4 TestPrep 28811 non-null object 5 ParentMaritalStatus 29451 non-null object 6 PracticeSport 30010 non-null object 7 IsFirstChild 29737 non-null object 8 NrSiblings 29069 non-null float64 9 TransportMeans 27507 non-null object 10 WklyStudyHours 29686 non-null object 11 MathScore 30641 non-null int64 12 ReadingScore 30641 non-null int64 13 WritingScore 30641 non-null int64 dtypes: float64(1), int64(3), object(10) memory usage: 3.3+ MB
Ultimately there can be numerous reasons for why there are so many missing values especially when we have limited knowledge on the data gathering and study creation process. Left to our assumptions some possible reasons for the missing values could be the following:
Because we are unsure of the true reasoning it is best to investigate the data itself to see if we can uncover some of the missing data characteristics that will help aid our decision for handling the missing data.
#missing data heatmap
plt.figure(figsize=(12,12), dpi = 300)
sns.heatmap(df.isnull(), cmap = "YlGnBu", cbar = False)
plt.show()
# Checking Missing Data
df.isnull().sum()
Gender 0 EthnicGroup 1840 ParentEduc 1845 LunchType 0 TestPrep 1830 ParentMaritalStatus 1190 PracticeSport 631 IsFirstChild 904 NrSiblings 1572 TransportMeans 3134 WklyStudyHours 955 MathScore 0 ReadingScore 0 WritingScore 0 dtype: int64
# Percentage of missing data
100 * df.isnull().sum() / len(df)
Gender 0.000000 EthnicGroup 6.005026 ParentEduc 6.021344 LunchType 0.000000 TestPrep 5.972390 ParentMaritalStatus 3.883685 PracticeSport 2.059332 IsFirstChild 2.950295 NrSiblings 5.130381 TransportMeans 10.228126 WklyStudyHours 3.116739 MathScore 0.000000 ReadingScore 0.000000 WritingScore 0.000000 dtype: float64
def percent_missing(df):
'''
INPUT
df - dataframe
OUTPUT
percent_nan - percentage of nulls in dataframe
'''
percent_nan = 100* df.isnull().sum()/len(df)
percent_nan = percent_nan[percent_nan > 0].sort_values()
return percent_nan
percent_nan = percent_missing(df)
# plot of missing data (percentage)
plt.figure(dpi = 150)
sns.barplot(x=percent_nan.index, y = percent_nan)
plt.title("Bar Plot of Missing Data (%)")
plt.xlabel("Explanatory Variables")
plt.ylabel("Percent Missing (%)")
plt.xticks(rotation = 45);
After investigation of our data and under the assumption that our data are missing completely at random, there are a number of ways for us to proceed: 1) Discard observations with any missing values 2) Relying on a learning algorithm to deal with missing values in the training phase 3) Impute all missing values before training.
The 2 options discussed for our group:
Option 1 would effectively remove 37.2% of the observations in our data frame. This is quite a substantial loss of information and was ultimately decided against.
Option 3 (imputation). This option seems to make sense given that the data appear to be MCAR, it will allow us to keep the 37% of our data effectively preserving our sample size, and it will help us to maintain more statistical power and precision. It should however be noted that while data imputation is very useful in handling missing values some tradeoffs and consideration are to be had such as:
df = df.apply(lambda x:x.fillna(x.value_counts().index[0]))
# viewing duplicated data rows/cols
duplicate_sum = df.duplicated().sum()
print("Total Duplicated Rows", duplicate_sum)
duplicate_sum_column = df.duplicated(). sum(axis =0)
print("Total Duplicated Columns", duplicate_sum_column)
Total Duplicated Rows 1 Total Duplicated Columns 1
To see the overall distribution and outliers of our data we created histograms for the test score variables. Additionally, so that we can make efficient use of our plots, we have overlaid the median score on the histograms so that we can see where that lies within the distribution for each of the test scores.
We chose the median because we found that the data is skewed; therefore, the median is the more appropriate way method.
math_m = df["MathScore"].median()
write_m = df["WritingScore"].median()
read_m = df["ReadingScore"].median()
#math_m, write_m, read_m
m_df = pd.DataFrame({"Math Score Median":[math_m],
"Writing Score Median": [write_m],
"Reading Score Median":[read_m]})
m_df
| Math Score Median | Writing Score Median | Reading Score Median | |
|---|---|---|---|
| 0 | 67.0 | 69.0 | 70.0 |
# plot histogram and median
fig, axes = plt.subplots(nrows=2, ncols=2,figsize=(12, 12))
a = sns.histplot(x = "ReadingScore",data=df,ax=axes[0,0], bins = 20)
a.axvline(read_m, color = 'r', linestyle = 'dashed', linewidth = 2)
b = sns.histplot(x = "WritingScore",data=df,ax=axes[0,1],bins = 20)
b.axvline(write_m, color = 'r', linestyle = 'dashed', linewidth = 2)
b = sns.histplot(x = "MathScore",data=df,ax=axes[1,0],bins = 20)
b.axvline(math_m, color = 'r', linestyle = 'dashed', linewidth = 2)
axes[0, 0].set_title("Histogram of Reading Score with Median")
axes[0, 1].set_title("Histogram of Writing Score with Median")
axes[1, 0].set_title("Histogram of Math Score with Median")
fig.delaxes(axes[1][1]);
While there are a number of outliers and one notable one is 0 in Math Score, we have decided that it wouldn't make any sense to remove any of them because we believe that every single value in each of these score columns are rather important. We will not remove any data points.
def find_IQR(df,x):
'''
INPUT
df - dataframe to be used
x - interested column
OUTPUT
Q1 - First Quartile
Q3 - Third Quartile
IQR - Interquartile Range of interested column
'''
# IQR
Q1 = np.percentile(df[x], 25)
Q3 = np.percentile(df[x], 75)
IQR = Q3 - Q1
return Q1, Q3, IQR
# Math IQR
Q1, Q3, IQR = find_IQR(df, "MathScore")
# Upper Bound
math_upper=Q3+1.5*IQR
mathupper_array=df[np.array(df["MathScore"]>=math_upper)]
# Lower bound
math_lower=Q1-1.5*IQR
mathlower_array=df[np.array(df["MathScore"]<=math_lower)]
mathlower_array
| Gender | EthnicGroup | ParentEduc | LunchType | TestPrep | ParentMaritalStatus | PracticeSport | IsFirstChild | NrSiblings | TransportMeans | WklyStudyHours | MathScore | ReadingScore | WritingScore | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 17 | female | group B | some high school | free/reduced | none | divorced | sometimes | yes | 1.0 | school_bus | 5 - 10 | 18 | 32 | 28 |
| 55 | female | group C | some high school | free/reduced | none | single | regularly | yes | 3.0 | school_bus | > 10 | 0 | 17 | 10 |
| 136 | female | group C | some college | free/reduced | none | single | regularly | yes | 1.0 | school_bus | < 5 | 21 | 38 | 33 |
| 319 | female | group B | some high school | free/reduced | none | single | sometimes | no | 3.0 | school_bus | < 5 | 23 | 37 | 26 |
| 751 | female | group B | some college | standard | none | single | regularly | no | 4.0 | private | < 5 | 18 | 37 | 32 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 28544 | female | group C | some college | free/reduced | none | single | sometimes | yes | 3.0 | private | 5 - 10 | 11 | 39 | 34 |
| 28597 | female | group D | high school | free/reduced | completed | married | sometimes | yes | 2.0 | school_bus | > 10 | 9 | 32 | 32 |
| 29000 | male | group A | some college | free/reduced | none | married | sometimes | no | 1.0 | school_bus | < 5 | 21 | 18 | 21 |
| 29654 | female | group B | associate's degree | free/reduced | none | single | sometimes | no | 4.0 | private | 5 - 10 | 19 | 41 | 34 |
| 29947 | female | group B | some college | free/reduced | none | married | regularly | no | 1.0 | private | 5 - 10 | 23 | 44 | 38 |
93 rows × 14 columns
# Reading IQR
Q1, Q3, IQR = find_IQR(df, "ReadingScore")
# Upper Bound
reading_upper=Q3+1.5*IQR
readingupper_array=df[np.array(df["ReadingScore"]>=reading_upper)]
# Lower bound
reading_lower=Q1-1.5*IQR
readinglower_array=df[np.array(df["ReadingScore"]<=reading_lower)]
readinglower_array
| Gender | EthnicGroup | ParentEduc | LunchType | TestPrep | ParentMaritalStatus | PracticeSport | IsFirstChild | NrSiblings | TransportMeans | WklyStudyHours | MathScore | ReadingScore | WritingScore | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 55 | female | group C | some high school | free/reduced | none | single | regularly | yes | 3.0 | school_bus | > 10 | 0 | 17 | 10 |
| 69 | male | group E | some high school | standard | none | married | sometimes | no | 5.0 | school_bus | 5 - 10 | 28 | 25 | 21 |
| 308 | male | group A | some college | free/reduced | none | married | regularly | yes | 2.0 | school_bus | > 10 | 29 | 23 | 19 |
| 565 | male | group B | high school | free/reduced | none | married | regularly | no | 2.0 | school_bus | 5 - 10 | 30 | 24 | 15 |
| 928 | female | group B | high school | free/reduced | none | married | sometimes | yes | 1.0 | private | 5 - 10 | 8 | 24 | 23 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 28698 | male | group C | associate's degree | free/reduced | none | married | sometimes | no | 1.0 | school_bus | 5 - 10 | 27 | 26 | 23 |
| 29000 | male | group A | some college | free/reduced | none | married | sometimes | no | 1.0 | school_bus | < 5 | 21 | 18 | 21 |
| 30389 | male | group B | associate's degree | free/reduced | none | divorced | never | no | 1.0 | school_bus | 5 - 10 | 24 | 14 | 4 |
| 30412 | male | group C | some high school | free/reduced | none | single | regularly | no | 1.0 | private | 5 - 10 | 33 | 25 | 29 |
| 30478 | male | group D | associate's degree | standard | none | married | sometimes | no | 1.0 | school_bus | < 5 | 32 | 19 | 20 |
90 rows × 14 columns
# Writing IQR
Q1, Q3, IQR = find_IQR(df, "WritingScore")
# Upper Bound
write_upper=Q3+1.5*IQR
writeupper_array=df[np.array(df["WritingScore"]>=write_upper)]
# Lower bound
write_lower=Q1-1.5*IQR
writelower_array=df[np.array(df["WritingScore"]<=write_lower)]
writelower_array
| Gender | EthnicGroup | ParentEduc | LunchType | TestPrep | ParentMaritalStatus | PracticeSport | IsFirstChild | NrSiblings | TransportMeans | WklyStudyHours | MathScore | ReadingScore | WritingScore | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 55 | female | group C | some high school | free/reduced | none | single | regularly | yes | 3.0 | school_bus | > 10 | 0 | 17 | 10 |
| 69 | male | group E | some high school | standard | none | married | sometimes | no | 5.0 | school_bus | 5 - 10 | 28 | 25 | 21 |
| 308 | male | group A | some college | free/reduced | none | married | regularly | yes | 2.0 | school_bus | > 10 | 29 | 23 | 19 |
| 319 | female | group B | some high school | free/reduced | none | single | sometimes | no | 3.0 | school_bus | < 5 | 23 | 37 | 26 |
| 565 | male | group B | high school | free/reduced | none | married | regularly | no | 2.0 | school_bus | 5 - 10 | 30 | 24 | 15 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 29274 | female | group E | some college | free/reduced | none | married | sometimes | no | 1.0 | school_bus | 5 - 10 | 30 | 32 | 26 |
| 29817 | male | group D | high school | free/reduced | none | married | sometimes | no | 1.0 | private | 5 - 10 | 33 | 31 | 22 |
| 30389 | male | group B | associate's degree | free/reduced | none | divorced | never | no | 1.0 | school_bus | 5 - 10 | 24 | 14 | 4 |
| 30478 | male | group D | associate's degree | standard | none | married | sometimes | no | 1.0 | school_bus | < 5 | 32 | 19 | 20 |
| 30630 | male | group B | associate's degree | free/reduced | none | married | sometimes | no | 4.0 | private | 5 - 10 | 43 | 28 | 24 |
109 rows × 14 columns
# Create a DataFrame of the scores
scores_df = df[["ReadingScore", "MathScore", "WritingScore"]]
# Melt the DataFrame to "long-form" format
scores_melt = pd.melt(scores_df)
# Create the boxplot using Seaborn
sns.boxplot(x="variable", y="value", data=scores_melt)
# Add labels to the x-axis ticks
plt.xticks([0, 1, 2], ["Reading Scores", "Math Scores", "Writing Scores"])
# Add labels to the y-axis
plt.ylabel("Score")
# Add a title
plt.title("Boxplot of Reading, Math, and Writing Scores")
# Display the plot
plt.show()
Analyzing summary statistics, handling outliers, missing values, and duplicated data are essential steps in the data preprocessing phase of conducting EDA. Additionally, this step ensures we are able to have reliable and accurate data to work with and use to conduct experiments where we have a full understanding of the interpretability and application that our data allows.
df.describe().transpose()
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| NrSiblings | 30641.0 | 2.087106 | 1.442665 | 0.0 | 1.0 | 2.0 | 3.0 | 7.0 |
| MathScore | 30641.0 | 66.558402 | 15.361616 | 0.0 | 56.0 | 67.0 | 78.0 | 100.0 |
| ReadingScore | 30641.0 | 69.377533 | 14.758952 | 10.0 | 59.0 | 70.0 | 80.0 | 100.0 |
| WritingScore | 30641.0 | 68.418622 | 15.443525 | 4.0 | 58.0 | 69.0 | 79.0 | 100.0 |
# Calculate the statistics using describe()
reading_stats = round(df["ReadingScore"].describe(),2)
math_stats = round(df["MathScore"].describe(),2)
writing_stats = round(df["WritingScore"].describe(),2)
# Calculate the mode separately
reading_mode = df["ReadingScore"].mode()
math_mode = df["MathScore"].mode()
writing_mode = df["WritingScore"].mode()
# Calculate the Median Separately
reading_median = df["ReadingScore"].median()
math_median = df["MathScore"].median()
writing_median = df["WritingScore"].median()
# Calculate the Range Separately
reading_range = (df["ReadingScore"].max()-df["ReadingScore"].min())
math_range = (df["MathScore"].max()-df["MathScore"].min())
writing_range = (df["WritingScore"].max()-df["WritingScore"].min())
# Add the mode to the statistics DataFrame
reading_stats["Mode"] = reading_mode[0]
math_stats["Mode"] = math_mode[0]
writing_stats["Mode"] = writing_mode[0]
# Add the Median to the statistics DataFrame
reading_stats["Median"] = reading_median
math_stats["Median"] = math_median
writing_stats["Median"] = writing_median
# Add the Range to the statistics DataFrame
reading_stats["Range"] = reading_range
math_stats["Range"] = math_range
writing_stats["Range"] = writing_range
# Combine the statistics into a single DataFrame
stats_df = pd.DataFrame({"Reading Score": reading_stats,
"Math Score": math_stats,
"Writing Score": writing_stats})
# Convert the DataFrame to a visual table
table = tabulate(stats_df, headers='keys', tablefmt='fancy_grid')
# Print the table
print(table)
╒════════╤═════════════════╤══════════════╤═════════════════╕ │ │ Reading Score │ Math Score │ Writing Score │ ╞════════╪═════════════════╪══════════════╪═════════════════╡ │ count │ 30641 │ 30641 │ 30641 │ ├────────┼─────────────────┼──────────────┼─────────────────┤ │ mean │ 69.38 │ 66.56 │ 68.42 │ ├────────┼─────────────────┼──────────────┼─────────────────┤ │ std │ 14.76 │ 15.36 │ 15.44 │ ├────────┼─────────────────┼──────────────┼─────────────────┤ │ min │ 10 │ 0 │ 4 │ ├────────┼─────────────────┼──────────────┼─────────────────┤ │ 25% │ 59 │ 56 │ 58 │ ├────────┼─────────────────┼──────────────┼─────────────────┤ │ 50% │ 70 │ 67 │ 69 │ ├────────┼─────────────────┼──────────────┼─────────────────┤ │ 75% │ 80 │ 78 │ 79 │ ├────────┼─────────────────┼──────────────┼─────────────────┤ │ max │ 100 │ 100 │ 100 │ ├────────┼─────────────────┼──────────────┼─────────────────┤ │ Mode │ 65 │ 64 │ 67 │ ├────────┼─────────────────┼──────────────┼─────────────────┤ │ Median │ 70 │ 67 │ 69 │ ├────────┼─────────────────┼──────────────┼─────────────────┤ │ Range │ 90 │ 100 │ 96 │ ╘════════╧═════════════════╧══════════════╧═════════════════╛
Range: The range of Reading Score, Math Score, and Writing Score can provide an idea of how spread out the scores are and the variability of the performance of students in these subjects.
Mode: The mode of Reading Score, Math Score, and Writing Score can provide an idea of the most common score achieved by students in these subjects, which can be useful for identifying areas of strength and weakness.
Mean: The mean of Reading Score, Math Score, and Writing Score can provide an idea of the average performance of students in these subjects, which can be useful for comparing the performance of different groups of students.
Median: The median of Reading Score, Math Score, and Writing Score can provide an idea of the central tendency of the scores, which can be useful for identifying outliers and extreme values that may skew the mean.
Standard Deviation: The standard deviation of Reading Score, Math Score, and Writing Score can provide an idea of the spread of the scores around the mean, which can be useful for identifying the variability of the performance of students in these subjects.
Min: The minimum score of Reading Score, Math Score, and Writing Score can provide an idea of the lowest score achieved by students in these subjects.
25%, 50%, and 75%: The quartiles of Reading Score, Math Score, and Writing Score can provide an idea of the spread and distribution of the scores, which can be useful for identifying the performance of students in different percentile ranges.
Max: The maximum score of Reading Score, Math Score, and Writing Score can provide an idea of the highest score achieved by students in these subjects.
Count: The count of Reading Score, Math Score, and Writing Score can provide an idea of the sample size of students who took the test in each subject, which can be useful for making inferences about the population of students.
#Number of Missing Values in Each Column
df.isnull().sum()
df3 = df
df3.describe()
df3.head()
| Gender | EthnicGroup | ParentEduc | LunchType | TestPrep | ParentMaritalStatus | PracticeSport | IsFirstChild | NrSiblings | TransportMeans | WklyStudyHours | MathScore | ReadingScore | WritingScore | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | female | group C | bachelor's degree | standard | none | married | regularly | yes | 3.0 | school_bus | < 5 | 71 | 71 | 74 |
| 1 | female | group C | some college | standard | none | married | sometimes | yes | 0.0 | school_bus | 5 - 10 | 69 | 90 | 88 |
| 2 | female | group B | master's degree | standard | none | single | sometimes | yes | 4.0 | school_bus | < 5 | 87 | 93 | 91 |
| 3 | male | group A | associate's degree | free/reduced | none | married | never | no | 1.0 | school_bus | 5 - 10 | 45 | 56 | 42 |
| 4 | male | group C | some college | standard | none | married | sometimes | yes | 0.0 | school_bus | 5 - 10 | 76 | 78 | 75 |
Visual Interpretation
Why the Visual is Appropriate:
#Full Bar Plot Comparisons
sns.pairplot(df3[["WritingScore", "ReadingScore","MathScore", "Gender"]], hue= "Gender")
plt.show()
## boxplot
def plot_imputed(col):
'''
INPUT:
col - choose interested column for the x-axis; write with quotations
'''
fig, axes = plt.subplots(nrows = 2, ncols = 2, figsize = (20,15))
sns.boxplot(ax=axes[0,0], data = df3, x = col, y = "MathScore",)
sns.boxplot(ax = axes[0,1], data = df3, x = col, y = "ReadingScore")
sns.boxplot(ax=axes[1,0], data = df3, x = col, y = "WritingScore")
sns.countplot(ax=axes[1,1], x=col, data=df3)
axes[0,0].set_title(f"Math Scores by {col}")
axes[0,1].set_title(f"Reading Scores by {col}")
axes[1,0].set_title(f"Writing Scores by {col}")
axes[1,1].set_title(f"{col} Count")
plt.tight_layout()
plt.show()
## multivariate plot
def plot_imputed_2var(col, col2):
'''
INPUT:
col - choose interested column for the x-axis; write with quotations
col2 - choose interested column for the hue; write with quotations
'''
fig, axes = plt.subplots(nrows=2, ncols=2, figsize=(20, 15))
sns.boxplot(ax=axes[0, 0],x=col, y="MathScore",data=df3,hue=col2)
axes[0, 0].set_title(f"MathScore for {col} Grouped by {col2}")
p = sns.boxplot(ax=axes[0, 1],x=col, y="ReadingScore",data=df3,hue=col2)
axes[0, 1].set_title(f"ReadingScore for ParentEduc Grouped by {col2}")
p = sns.boxplot(ax=axes[1, 0],x=col, y="WritingScore",data=df3,hue=col2)
axes[1, 0].set_title(f"WritingScore for ParentEduc Grouped by {col2}")
# customize x-axis tick labels for each subplot
axes[0, 0].set_xticklabels(axes[0, 0].get_xticklabels(), rotation=70)
axes[0, 1].set_xticklabels(axes[0, 1].get_xticklabels(), rotation=70)
axes[1, 0].set_xticklabels(axes[1, 0].get_xticklabels(), rotation=70)
fig.delaxes(axes[1][1])
plt.tight_layout()
plt.show()
Visual Interpretation
Why the Visual is Appropriate:
df3["ParentEduc"] = df3["ParentEduc"].astype('category')
df3["ParentEduc"] = df3["ParentEduc"].cat.reorder_categories(["some high school","high school",
"some college", "associate's degree", "bachelor's degree", "master's degree"])
plot_imputed("ParentEduc")
Visual Interpretation
Why the Visual is Appropriate:
df3['EthnicGroup'] = df3['EthnicGroup'].astype('category')
df3["EthnicGroup"].cat.reorder_categories(["group A","group B", "group C", "group D", "group E"])
plot_imputed("EthnicGroup")
Visual Interpretation
Why the Visual is Appropriate:
plot_imputed("TestPrep")
Visual Interpretation
plot_imputed("LunchType")
After analyzing the data in the Gender | Top Feat. section above, we observed a visually significant pattern between test scores and gender. Females tend to achieve higher scores in reading and writing, while males tend to excel in math. to further visualize this comparative relationship we provided box plot and an additional count plot which suggests a relatively equal distribution of males and females in our dataset.
plot_imputed("Gender")
Cross Tabulation Defined:
# create a function for the mentioned feature above
def pass_fail(score):
if score >= 70:
return "Pass"
else:
return "Fail"
df["MathResult"] = df["MathScore"].apply(pass_fail)
df["ReadingResult"] = df["ReadingScore"].apply(pass_fail)
df["WritingResult"] = df["WritingScore"].apply(pass_fail)
cross_tab1 = pd.crosstab(df['ParentEduc'], df['MathResult'], dropna=False)
print("\n",cross_tab1,"\n")
# Calculate and print the conditional probabilities
total_by_parent_educ = cross_tab1.sum(axis=1)
conditional_probs = cross_tab1.div(total_by_parent_educ, axis=0)
print(f"Conditional Probabilities:\n{conditional_probs}")
cross_tab2 = pd.crosstab(df['LunchType'], df['MathResult'], dropna=False)
print("\n",cross_tab2,"\n")
# Calculate and print the conditional probabilities
total_by_parent_educ = cross_tab2.sum(axis=1)
conditional_probs = cross_tab2.div(total_by_parent_educ, axis=0)
print(f"Conditional Probabilities:\n{conditional_probs}")
cross_tab3 = pd.crosstab(df['EthnicGroup'], df['MathResult'], dropna=False)
print("\n",cross_tab3,"\n")
# Calculate and print the conditional probabilities
total_by_parent_educ = cross_tab3.sum(axis=1)
conditional_probs = cross_tab3.div(total_by_parent_educ, axis=0)
print(f"Conditional Probabilities:\n{conditional_probs}")
cross_tab4 = pd.crosstab(df['TestPrep'], df['MathResult'], dropna=False)
print("\n",cross_tab4,"\n")
# Calculate and print the conditional probabilities
total_by_parent_educ = cross_tab4.sum(axis=1)
conditional_probs = cross_tab4.div(total_by_parent_educ, axis=0)
print(f"Conditional Probabilities:\n{conditional_probs}")
MathResult Fail Pass ParentEduc some high school 3662 1855 high school 3528 2159 some college 4794 3684 associate's degree 2896 2654 bachelor's degree 1566 1820 master's degree 814 1209 Conditional Probabilities: MathResult Fail Pass ParentEduc some high school 0.663767 0.336233 high school 0.620362 0.379638 some college 0.565464 0.434536 associate's degree 0.521802 0.478198 bachelor's degree 0.462493 0.537507 master's degree 0.402373 0.597627 MathResult Fail Pass LunchType free/reduced 8147 2589 standard 9113 10792 Conditional Probabilities: MathResult Fail Pass LunchType free/reduced 0.758849 0.241151 standard 0.457825 0.542175 MathResult Fail Pass EthnicGroup group A 1465 754 group B 3751 2075 group C 6673 4379 group D 4010 3493 group E 1361 2680 Conditional Probabilities: MathResult Fail Pass EthnicGroup group A 0.660207 0.339793 group B 0.643838 0.356162 group C 0.603782 0.396218 group D 0.534453 0.465547 group E 0.336798 0.663202 MathResult Fail Pass TestPrep completed 4848 5107 none 12412 8274 Conditional Probabilities: MathResult Fail Pass TestPrep completed 0.486991 0.513009 none 0.600019 0.399981
# plot all Math Cross-tabulations
fig, axes = plt.subplots(nrows=2,ncols=2, figsize=(12,10),dpi=200)
cross_tab_plots = [
(cross_tab1, "Parent Educ"),
(cross_tab2, "Lunch Type"),
(cross_tab3, "EthnicGroup"),
(cross_tab4, "Test Prep")
]
for i, (cross_tab, label) in enumerate(cross_tab_plots):
ax = axes[i//2, i%2]
cross_tab.plot(kind='bar',stacked = False, ax =ax)
ax.set_ylabel("Count")
ax.set_title(f"Cross Tabulation: {label} Vs Math Result")
ax.legend(title = "Test Score Result")
plt.tight_layout()
plt.show()
cross_tab1a = pd.crosstab(df['ParentEduc'], df['ReadingResult'], dropna=False)
print("\n",cross_tab1a,"\n")
# Calculate and print the conditional probabilities
total_by_parent_educ = cross_tab1a.sum(axis=1)
conditional_probs = cross_tab1a.div(total_by_parent_educ, axis=0)
print(f"Conditional Probabilities:\n{conditional_probs}")
cross_tab2a = pd.crosstab(df['LunchType'], df['ReadingResult'], dropna=False)
print("\n",cross_tab2a,"\n")
# Calculate and print the conditional probabilities
total_by_parent_educ = cross_tab2a.sum(axis=1)
conditional_probs = cross_tab2a.div(total_by_parent_educ, axis=0)
print(f"Conditional Probabilities:\n{conditional_probs}")
cross_tab3a = pd.crosstab(df['EthnicGroup'], df['ReadingResult'], dropna=False)
print("\n",cross_tab3a,"\n")
# Calculate and print the conditional probabilities
total_by_parent_educ = cross_tab3a.sum(axis=1)
conditional_probs = cross_tab3a.div(total_by_parent_educ, axis=0)
print(f"Conditional Probabilities:\n{conditional_probs}")
cross_tab4a = pd.crosstab(df['TestPrep'], df['ReadingResult'], dropna=False)
print("\n",cross_tab4a,"\n")
# Calculate and print the conditional probabilities
total_by_parent_educ = cross_tab4a.sum(axis=1)
conditional_probs = cross_tab4a.div(total_by_parent_educ, axis=0)
print(f"Conditional Probabilities:\n{conditional_probs}")
ReadingResult Fail Pass ParentEduc some high school 3308 2209 high school 3127 2560 some college 4267 4211 associate's degree 2524 3026 bachelor's degree 1357 2029 master's degree 670 1353 Conditional Probabilities: ReadingResult Fail Pass ParentEduc some high school 0.599601 0.400399 high school 0.549851 0.450149 some college 0.503303 0.496697 associate's degree 0.454775 0.545225 bachelor's degree 0.400768 0.599232 master's degree 0.331191 0.668809 ReadingResult Fail Pass LunchType free/reduced 6842 3894 standard 8411 11494 Conditional Probabilities: ReadingResult Fail Pass LunchType free/reduced 0.637295 0.362705 standard 0.422557 0.577443 ReadingResult Fail Pass EthnicGroup group A 1303 916 group B 3197 2629 group C 5706 5346 group D 3566 3937 group E 1481 2560 Conditional Probabilities: ReadingResult Fail Pass EthnicGroup group A 0.587201 0.412799 group B 0.548747 0.451253 group C 0.516287 0.483713 group D 0.475277 0.524723 group E 0.366493 0.633507 ReadingResult Fail Pass TestPrep completed 3802 6153 none 11451 9235 Conditional Probabilities: ReadingResult Fail Pass TestPrep completed 0.381919 0.618081 none 0.553563 0.446437
# plot all Reading Cross-Tabulations
fig, axes = plt.subplots(nrows=2,ncols=2, figsize=(12,10),dpi=200)
cross_tab_plots = [
(cross_tab1a, "Parent Educ"),
(cross_tab2a, "Lunch Type"),
(cross_tab3a, "EthnicGroup"),
(cross_tab4a, "Test Prep")
]
for i, (cross_tab, label) in enumerate(cross_tab_plots):
ax = axes[i//2, i%2]
cross_tab.plot(kind='bar',stacked = False, ax =ax)
ax.set_ylabel("Count")
ax.set_title(f"Cross Tabulation: {label} Vs Reading Result")
ax.legend(title = "Test Score Result")
plt.tight_layout()
plt.show()
cross_tab1b = pd.crosstab(df['ParentEduc'], df['WritingResult'], dropna=False)
print("\n",cross_tab1b,"\n")
# Calculate and print the conditional probabilities
total_by_parent_educ = cross_tab1b.sum(axis=1)
conditional_probs = cross_tab1b.div(total_by_parent_educ, axis=0)
print(f"Conditional Probabilities:\n{conditional_probs}")
cross_tab2b = pd.crosstab(df['LunchType'], df['WritingResult'], dropna=False)
print("\n",cross_tab2b,"\n")
# Calculate and print the conditional probabilities
total_by_parent_educ = cross_tab2b.sum(axis=1)
conditional_probs = cross_tab2b.div(total_by_parent_educ, axis=0)
print(f"Conditional Probabilities:\n{conditional_probs}")
cross_tab3b = pd.crosstab(df['EthnicGroup'], df['WritingResult'], dropna=False)
print("\n",cross_tab3b,"\n")
# Calculate and print the conditional probabilities
total_by_parent_educ = cross_tab3b.sum(axis=1)
conditional_probs = cross_tab3b.div(total_by_parent_educ, axis=0)
print(f"Conditional Probabilities:\n{conditional_probs}")
cross_tab4b = pd.crosstab(df['TestPrep'], df['WritingResult'], dropna=False)
print("\n",cross_tab4b,"\n")
# Calculate and print the conditional probabilities
total_by_parent_educ = cross_tab4b.sum(axis=1)
conditional_probs = cross_tab4b.div(total_by_parent_educ, axis=0)
print(f"Conditional Probabilities:\n{conditional_probs}")
WritingResult Fail Pass ParentEduc some high school 3542 1975 high school 3417 2270 some college 4414 4064 associate's degree 2612 2938 bachelor's degree 1366 2020 master's degree 635 1388 Conditional Probabilities: WritingResult Fail Pass ParentEduc some high school 0.642016 0.357984 high school 0.600844 0.399156 some college 0.520642 0.479358 associate's degree 0.470631 0.529369 bachelor's degree 0.403426 0.596574 master's degree 0.313890 0.686110 WritingResult Fail Pass LunchType free/reduced 7189 3547 standard 8797 11108 Conditional Probabilities: WritingResult Fail Pass LunchType free/reduced 0.669616 0.330384 standard 0.441949 0.558051 WritingResult Fail Pass EthnicGroup group A 1351 868 group B 3391 2435 group C 6094 4958 group D 3486 4017 group E 1664 2377 Conditional Probabilities: WritingResult Fail Pass EthnicGroup group A 0.608833 0.391167 group B 0.582046 0.417954 group C 0.551393 0.448607 group D 0.464614 0.535386 group E 0.411779 0.588221 WritingResult Fail Pass TestPrep completed 3548 6407 none 12438 8248 Conditional Probabilities: WritingResult Fail Pass TestPrep completed 0.356404 0.643596 none 0.601276 0.398724
# plot all Reading Cross-Tabulations
fig, axes = plt.subplots(nrows=2,ncols=2, figsize=(12,10),dpi=200)
cross_tab_plots = [
(cross_tab1b, "Parent Educ"),
(cross_tab2b, "Lunch Type"),
(cross_tab3b, "EthnicGroup"),
(cross_tab4b, "Test Prep")
]
for i, (cross_tab, label) in enumerate(cross_tab_plots):
ax = axes[i//2, i%2]
cross_tab.plot(kind='bar',stacked = False, ax =ax)
ax.set_ylabel("Count")
ax.set_title(f"Cross Tabulation: {label} Vs Writing Result")
ax.legend(title = "Test Score Result")
plt.tight_layout()
plt.show()
Correlation Heatmap Defined:
Correlation Heatmap Warning:
Correlation Heatmap Purpose:
df["EthnicGroup"] = df["EthnicGroup"].astype(object)
df["ParentEduc"] = df["ParentEduc"].astype(object)
# new dfs with/without object type
df_obj = df.select_dtypes(include = "object")
df_num = df.select_dtypes(exclude = "object")
# view levels of columns
for col in df_obj.columns:
unique_values = df_obj[col].unique()
#print(f"Unique values in column '{col}':{unique_values}")
# creating dummy var for the object type df.
# dropping first = True for multicollinearity dummy trap
df_obj = pd.get_dummies(df_obj, drop_first = True)
# combining into one df
cor_df = pd.concat([df_num, df_obj], axis = 1)
plt.figure(figsize=(12,10), dpi=300)
sns.heatmap(cor_df.corr(),cmap = "magma", linecolor='white', linewidths=.75)
plt.show()
# raw data corr viz.
plt.figure(figsize=(12,8), dpi=200)
sns.heatmap(df.corr(), cmap = "magma", annot=True)
plt.show()
cor_df.corr()["WritingScore"].sort_values(ascending=False)
WritingScore 1.000000 ReadingScore 0.952584 WritingResult_Pass 0.808938 MathScore 0.807118 ReadingResult_Pass 0.773939 MathResult_Pass 0.650314 LunchType_standard 0.274305 ParentEduc_master's degree 0.136667 ParentEduc_bachelor's degree 0.112119 EthnicGroup_group E 0.107477 EthnicGroup_group D 0.091160 PracticeSport_regularly 0.056602 WklyStudyHours_> 10 0.039798 IsFirstChild_yes 0.009822 ParentEduc_some college 0.001525 ParentMaritalStatus_widowed 0.001315 ParentMaritalStatus_married -0.000964 NrSiblings -0.001584 TransportMeans_school_bus -0.004522 ParentMaritalStatus_single -0.008681 PracticeSport_sometimes -0.022002 WklyStudyHours_< 5 -0.052162 EthnicGroup_group C -0.061737 EthnicGroup_group B -0.079176 ParentEduc_high school -0.092659 ParentEduc_some high school -0.145231 TestPrep_none -0.282308 Gender_male -0.289312 Name: WritingScore, dtype: float64
When determining Writing Score with correlation, we find that there are several factors that can help us.
Firstly, the strongest relationships with the Writing scores are the other two exam scores(Math & Reading). Another strong relationship are the exam results which shouldn't come as a surprise since they're related to the score. Everything else can be classified as weak since their correlation is between 0.274 and -0.28. Some weak relationships that stand out are the gender(male) at -0.289312, TestPrep(none) at -0.282308 and LunchType(standard) at 0.274305.
This seems to affirm our findings in the previous section when we created boxplot to explore any relationship between these variables with the Writing Score.
cor_df.corr()["ReadingScore"].sort_values(ascending=False)
ReadingScore 1.000000 WritingScore 0.952584 MathScore 0.817825 ReadingResult_Pass 0.812386 WritingResult_Pass 0.773737 MathResult_Pass 0.657498 LunchType_standard 0.258152 EthnicGroup_group E 0.128716 ParentEduc_master's degree 0.116293 ParentEduc_bachelor's degree 0.087993 EthnicGroup_group D 0.038766 WklyStudyHours_> 10 0.030269 PracticeSport_regularly 0.028254 IsFirstChild_yes 0.012117 ParentMaritalStatus_widowed 0.002603 ParentMaritalStatus_married 0.000164 NrSiblings -0.000412 TransportMeans_school_bus -0.004933 ParentEduc_some college -0.007873 ParentMaritalStatus_single -0.008195 PracticeSport_sometimes -0.008583 EthnicGroup_group C -0.043759 WklyStudyHours_< 5 -0.049362 EthnicGroup_group B -0.067535 ParentEduc_high school -0.069982 ParentEduc_some high school -0.122773 TestPrep_none -0.204724 Gender_male -0.237097 Name: ReadingScore, dtype: float64
The Reading Score seems to tell the same story where Math, Writing Scores and Exam Results have the highest/strongest correlation.
The other weak variables are the same, but should be given attention to, even though their values are a little different. We have: Gender(male) at -0.237097, TestPrep(none) at -0.204724 and LunchType(standard) at 0.258152.
cor_df.corr()["MathScore"].sort_values(ascending=False)
MathScore 1.000000 ReadingScore 0.817825 WritingScore 0.807118 MathResult_Pass 0.802676 ReadingResult_Pass 0.665192 WritingResult_Pass 0.653592 LunchType_standard 0.367942 EthnicGroup_group E 0.221775 Gender_male 0.162391 ParentEduc_master's degree 0.100001 ParentEduc_bachelor's degree 0.089675 WklyStudyHours_> 10 0.062945 PracticeSport_regularly 0.061482 EthnicGroup_group D 0.041074 IsFirstChild_yes 0.014809 ParentMaritalStatus_widowed 0.007399 ParentMaritalStatus_married 0.007144 TransportMeans_school_bus 0.002351 NrSiblings -0.000538 ParentEduc_some college -0.004527 ParentMaritalStatus_single -0.014035 PracticeSport_sometimes -0.018130 ParentEduc_high school -0.065967 WklyStudyHours_< 5 -0.078084 EthnicGroup_group C -0.078949 EthnicGroup_group B -0.096779 ParentEduc_some high school -0.121240 TestPrep_none -0.134949 Name: MathScore, dtype: float64
The Math Score correlation is slightly different from our previous two(Writing and Reading). A couple of similarities between this and the previous two are that the other two exams scores and its results have the highest correlation with our interested variable.
Next, we find that some weak relationship that should be paid attention to are: LunchType(standard) at 0.367942, EthnicGroup(E) at 0.221775, Gender(male) at 0.162391 and, if we want to add a weaker link, we can explore TestPrep(none) at -0.134949.
Now we want to take a look at the scores when ParentEduc is paired with another variable, so we can understand our data better. In this case, we can see how the scores fare when we pair the education level with Gender.
On the writing and reading scores, no matter the education level, it seems that the female scores higher than the male. However, the male scores higher in the math scores across all the education levels.
df3["ParentEduc"] = df3["ParentEduc"].astype('category')
df3["ParentEduc"] = df3["ParentEduc"].cat.reorder_categories(["some high school","high school",
"some college", "associate's degree",
"bachelor's degree", "master's degree"])
plot_imputed_2var("ParentEduc", "Gender")
Similar to the previous visual analysis, we want to add another variable to see how the scores fare, but, this time, we want to see it with Ethnic Group and Test Prep.
Across the board, we see that those who've completed their test preps achieve a higher score than those who have done none, which shouldn't come as a surprise. It seems like Group E scores the highest overall, except when we examine the Writing Score. Group E and D seem to be neck-and-neck, having roughly the same score with possibly an infinitesimal difference.
df3['EthnicGroup'] = df3['EthnicGroup'].astype('category')
df3["EthnicGroup"].cat.reorder_categories(["group A","group B", "group C", "group D", "group E"])
plot_imputed_2var("EthnicGroup", "TestPrep")
Taking Test Prep and Scores and adding Gender, we want to create a multivariate plot with all the variables.
Similar to what we found above, those who've completed the test prep have higher scores. Females have a higher median on reading and writing while male take the prize for math.
plot_imputed_2var("TestPrep", "Gender")
Visual Interpretation
Why the Visual is Appropriate:
df3["ParentMaritalStatus"] = df3["ParentMaritalStatus"].astype('category')
df3["ParentMaritalStatus"] = df3["ParentMaritalStatus"].cat.reorder_categories(["single", "married", "divorced",
"widowed"])
plot_imputed("ParentMaritalStatus")
Once again, using ParentMaritalStatus LunchType and above and adding a new one(LunchType), we want to create a multivariate plot to see how the scores behave.
This is interesting because we didn't see much of a difference in score across marital status by itself, but, when we add Lunch Type, we find that those with free/reduced lunches score significantly lower than their counterpart.
plot_imputed_2var("ParentMaritalStatus", "LunchType")
Looking at the boxplots, we find that those who practice sports regularly achieve higher test scores across the board and, the less they practice, the lower their test scores.One interesting aspect to note is the number of students who claim to never practice sports. It is less than half of those who practice regularly.
plot_imputed("PracticeSport")
Adding WklyStudyHours to our plot, we don't really discover anything new. Those who study more than 10 hours have a higher median score whether or not a person practices sports, but the difference is miniscule for Reading and Writing. However, the difference is more notable in math as the difference between each category for WklyStudyHours is more noticeable.
plot_imputed_2var("WklyStudyHours", "PracticeSport")
The boxplots proved what we expected through intuition. The more an individual studies the higher the score they achieve. This is true across the board. However, we should note that the count for those who study 5-10 hours is much higher than the other two.
plot_imputed("WklyStudyHours")
This time we would like to create a multivariable plot with our previous 2 variables and EthnicGroup.
As expected, regardless of Ethnic Group, those who've logged more than 10 hours tend to have a higher score than those who've studied for less than 10. It's also interesting to note that Group E has the highest median score across all the plots when comparing each respective hour range of each group. One interesting aspect to note is that in Group A, those who've logged more than 10 hours have a notable difference than those who've logged 5-10 hours. This difference is greater than the difference of all the other groups(comparing 5-10 and >10).
plot_imputed_2var("EthnicGroup", "WklyStudyHours")
Features that could be added to the data:
1) School (or geographic location/district)
2) Teacher (or class)
3) Extracurricular Activities (other than sport specifically)
4) Parent Household Income
5) Work with Tutor
6) Teacher offers tutoring
7) Age
8) Participant in "student government"
9) Parent participation in PTA
10) Class or school average GPA
11) Student GPA
Features that can be created from existing features:
1) Pass/Fail for Math/Reading/Writing can be created by creating levels of score data (>=70 : pass, <70 : Fail)
Pass | Fail Variable
1) Classification vs. Regression: With a pass/fail variable, the prediction task typically shifts from regression (predicting a continuous variable) to classification (predicting a discrete class label). This change in problem formulation affects the choice of algorithms and evaluation metrics. Classification algorithms such as logistic regression, decision trees, random forests, or support vector machines may be more suitable for predicting pass/fail outcomes.
2) Interpretability: Predicting a pass/fail outcome can be more interpretable and actionable than predicting a continuous variable. This new attribute allows us to gain insights into the factors contributing to pass/fail predictions, identifying which attributes or features have the most significant influence. This interpretability can aid decision-making, intervention strategies, and understanding the factors driving student success.
3) Model evaluation: The evaluation of pass/fail predictions often requires different metrics than those used for continuous variables. Accuracy, precision, recall, F1-score, and area under the ROC curve (AUC-ROC) are common metrics used to assess the performance of classification models.
The potential addition of various features to the data, including school, teacher, extracurricular activities, parent household income, and more, can provide valuable insights for analyzing student performance. Furthermore, creating a pass/fail variable based on score data introduces a binary assessment dimension that simplifies the understanding and analysis of student outcomes.
By incorporating these additional features and creating the pass/fail variable, educators and researchers can gain valuable insights into student performance, make informed decisions, and develop targeted interventions to improve educational outcomes.
Dimensionality reduction is the process of reducing the number of variables or features in a dataset. This is often done to simplify the data and make it easier to analyze or visualize. The two main types of dimensionality reduction are feature selection and feature extraction. Feature selection involves selecting a subset of the original variables, while feature extraction involves transforming the variables into a new set of variables. Principal Component Analysis (PCA) is a common technique used for feature extraction.
PCA was attempted but as expected it was not very useful/beneficial. PCA is more appropriate when dealing with many numerical features. While one-hot encoding variables is an option to manipulate data into a "numerical" form, this will not result in making PCA a more appropriate choice.
MCA seems to be a more appropriate process to implement and would serve as a more effective alternative. Before proceeding it must be stated compared to PCA, MCA loses the power of the interpretation that we would have had with PCA created variables.
import prince
import altair as alt
# allows us to work with the altair functionality and bypass the row limitation
alt.data_transformers.disable_max_rows()
DataTransformerRegistry.enable('default')
df = pd.read_csv('Expanded_data_with_more_features.csv') # read in data
df = df.drop("Unnamed: 0", axis = 1)
df = df.apply(lambda x:x.fillna(x.value_counts().index[0]))
# create a function for the mentioned feature above
def pass_fail(score):
if score >= 70:
return "Pass"
else:
return "Fail"
df["MathResult"] = df["MathScore"].apply(pass_fail)
df["ReadingResult"] = df["ReadingScore"].apply(pass_fail)
df["WritingResult"] = df["WritingScore"].apply(pass_fail)
selected_columns = ["MathResult","ReadingResult","WritingResult","EthnicGroup",
"ParentEduc", "LunchType", "TestPrep", "PracticeSport",
"ParentMaritalStatus","TransportMeans","WklyStudyHours",
"IsFirstChild", "Gender"]
df[selected_columns] = df[selected_columns].astype('category')
# set seed
np.random.seed(12)
df_dummy = pd.get_dummies(df[selected_columns])
# Perform (MCA) on the dummy dataframe
mca = prince.MCA(n_components=2, random_state=12)
mca = mca.fit(df_dummy)
# Get the column coordinates from MCA
mca_coordinates = mca.column_coordinates(df_dummy)
# Plot the MCA results with labels for categorical variables
plt.figure(figsize=(10, 8))
plt.scatter(mca_coordinates.iloc[:, 0], mca_coordinates.iloc[:, 1])
# Add labels for categorical variables
for i, (x, y) in enumerate(zip(mca_coordinates.iloc[:, 0], mca_coordinates.iloc[:, 1])):
plt.text(x, y, mca_coordinates.index[i], ha='center', va='center')
plt.xlabel('Component 0')
plt.ylabel('Component 1')
plt.title('MCA Plot with Categorical Variable Labels')
plt.grid(True)
plt.show()
# importing module
# import sys
# appending a path
#sys.path.append('/Users/dannychang/opt/miniconda3/pkgs/adjusttext-0.7.3.1-py_1/site-packages')
#!conda install adjustText
from adjustText import adjust_text
np.random.seed(12)
df_dummy = pd.get_dummies(df[selected_columns])
# Perform Multiple Correspondence Analysis (MCA) on the dummy dataframe
mca = prince.MCA(n_components=2, random_state=12)
mca = mca.fit(df_dummy)
# Get the column coordinates from MCA
mca_coordinates = mca.column_coordinates(df_dummy)
# Plot the MCA results with labels for categorical variables
plt.figure(figsize=(10, 8))
plt.scatter(mca_coordinates.iloc[:, 0], mca_coordinates.iloc[:, 1], color='blue')
# Define base line length
line_length = 0.1 # adjust as needed
# Calculate the median for x and y
median_x = mca_coordinates.iloc[:, 0].median()
median_y = mca_coordinates.iloc[:, 1].median()
# Variables to hold line ends and text objects for adjustment
line_ends_x = []
line_ends_y = []
texts = []
# Add labels for categorical variables with lines for clarity
for i, (x, y) in enumerate(zip(mca_coordinates.iloc[:, 0], mca_coordinates.iloc[:, 1])):
label = mca_coordinates.index[i]
# Adjust line length based on density of points
density = len([point for point in zip(mca_coordinates.iloc[:, 0], mca_coordinates.iloc[:, 1])
if np.sqrt((x - point[0]) ** 2 + (y - point[1]) ** 2) < line_length])
adjusted_line_length = line_length * density
if x <= median_x and y >= median_y: # Quadrant II
line_end = (x-adjusted_line_length, y-adjusted_line_length)
plt.plot([x, line_end[0]], [y, line_end[1]], color='red')
elif x > median_x and y >= median_y: # Quadrant I
line_end = (x+adjusted_line_length, y-adjusted_line_length)
plt.plot([x, line_end[0]], [y, line_end[1]], color='red')
elif x <= median_x and y < median_y: # Quadrant III
line_end = (x-adjusted_line_length, y+adjusted_line_length)
plt.plot([x, line_end[0]], [y, line_end[1]], color='red')
else: # Quadrant IV
line_end = (x+adjusted_line_length, y+adjusted_line_length)
plt.plot([x, line_end[0]], [y, line_end[1]], color='red')
line_ends_x.append(line_end[0])
line_ends_y.append(line_end[1])
texts.append(plt.text(line_end[0], line_end[1], label, ha='center', va='center'))
# Adjust labels to avoid overlap
adjust_text(texts, x=line_ends_x, y=line_ends_y)
plt.xlabel('Component 0')
plt.ylabel('Component 1')
plt.title('MCA Plot with Categorical Variable Labels')
plt.grid(True)
plt.show()
Multiple Correspondence Analysis (MCA) is a statistical visualization technique used for categorical variables.
MCA plots components and highlights possible relationships and patterns in the data. Each point on the plot represents a level of a categorical variable, and the distance between these points represents the relationships between them.
Some key insights we can obtain from an MCA plot are:
Proximity of Points: Points (or labels representing levels of categorical variables) close to each other on the plot share similar profiles, which means they have a similar distribution across the categorical variables in the dataset.
Proximity to Origin: Points located near the origin do not contribute much to the overall variability in the dataset. In contrast, points located far from the origin are particularly characteristic of the respective dimension they're located on.
Dimensions: The axes (dimensions) represent the underlying structure of the data. The first dimension (axis) explains the most variance in the dataset, the second dimension explains the second most variance, and so on.
Explained Variance: The total variance explained by the plot can be obtained. This gives us an idea of how much of the total variability of the dataset is represented in the plot.
From our plot we can use this to explore and interpret relationships between variables. An example of this would be examining where our response variables lie on the plot (such as MathResult_Pass or MathResult_Fail) and then using proximity to determine which other variables share similar profiles.
Business Understanding:
The dataset consists of scores from three tests along with personal and socio-economic factors, collected with the goal of identifying relationships between these factors and test scores. The objective is to determine the most significant factors influencing test scores, which can help educators, parents, and students focus on variables that can improve academic performance. The information gathered can inform educational policies, practices, and interventions to support struggling students and enhance overall educational outcomes. Understanding and analyzing the dataset can lead to insights regarding student performance factors, teaching practices, and the impact of socio-economic status on academic outcomes. Evaluating the effectiveness of prediction algorithms involves assessing their accuracy in predicting test scores and their interpretability in identifying influential factors. Both accuracy and interpretability are important in evaluating the effectiveness of a prediction algorithm.
Data Understanding:
Missing data:
After thorough consideration and investigation of the missing data in our dataset, we have decided to proceed with imputing the missing values based on frequency. The dataset initially exhibited a substantial number of missing values, amounting to 37.2% of the data, particularly in categorical variables. While the reasons for missingness remain unclear due to limited information on data collection, we have assumed that the missing data is Missing Completely at Random (MCAR).
To handle this missing data, we explored various options such as discarding observations with missing values, which would result in a significant loss of information, or relying on learning algorithms to handle missing values during training. However, we ultimately chose to impute the missing values based on frequency.
By imputing the missing values using the frequency-based approach, we can retain the entirety of our dataset and preserve the sample size, maintaining statistical power and precision. This method involves filling in the missing values with the most frequent category or response in each respective variable. While imputation has its limitations, including potential bias and assumptions, the frequency-based approach is a practical and reasonable choice given the MCAR assumption and the nature of our categorical variables.
It is important to acknowledge that imputation based on frequency may introduce some uncertainty, as it does not account for the variability of the missing values. Nevertheless, this chosen approach strikes a balance between preserving the integrity of the dataset and mitigating the potential biases associated with missing data.
In summary, by opting to impute the missing values based on frequency, we aim to address the missingness in our dataset, maximize the utilization of available data, and facilitate robust analyses that can contribute to informed decision-making and a deeper understanding of the relationships and factors influencing the outcomes of interest.
Top Attributes
The visualizations provide valuable insights into the relationship between different variables and test scores.
1) Gender: The visualizations indicate that females tend to have higher scores in Reading and Writing, while males score higher in Math. This suggests a potential gender-based difference or correlation in the relationship between math and reading/writing scores.
2) Parent Education: The visualizations demonstrate that there is an increasing pattern of median test scores as education levels increase. The boxplots provide a clear summary of the distribution of test scores for different levels of education, allowing for easy comparison and identification of central tendency, spread, and skewness.
3) Ethnic Groups: The visualizations show that the median test scores increase as we move from ethnic groups A to E. The boxplots effectively summarize the distribution of test scores for each ethnic group, enabling comparisons and identification of patterns or differences.
4) Test Prep: The visualizations confirm that completing test prep is associated with higher median scores compared to those who did not participate. The boxplots provide a concise summary of the distribution of test scores, allowing for easy comparison of central tendency, spread, and skewness between the two groups.
5) Lunch Type: The visualizations reveal that students with free/reduced lunch tend to score significantly lower than those with a standard lunch plan. The boxplot summarizes the distribution of test scores and provides insights into the central tendency, spread, and skewness of the data.
Overall, these visualizations are appropriate and effective in conveying the relationships between different variables and test scores, providing a quick summary of the data distribution and highlighting important patterns or differences.